This handout covers the topics that we will go through during the lab session.
There are two main sections in this handout, required and optional.
In the required section, we are going to set up R and RStudio on your computer. Learn how to use R packages and create R Markdown files. Then, we are going to go through the basics of R programming by analyzing a simple biological data set. Finally, we conclude the required section by discussing the advantages of using R for data analysis comparing to Excel.
In the optional section, we are going to run Linux operating system on a virtual machine.
Go to https://www.r-project.org/.
Click download R in the Getting Started section.
R-3.4.3.pkg. Open the downloaded file to install.base, then Download R 3.4.3 for Windows. Open the downloaded file to install.gdebi utility, and you can find more information at https://medium.com/@GalarnykMichael/install-r-and-rstudio-on-ubuntu-12-04-14-04-16-04-b6b3107f7779.Note on 32-bit and 64-bit: Most of the time, we only need to know this information when trying to install or run some software. 32-bit or 64-bit refers to the size of register in the central processing unit (CPU). 32-bit system allows direct access of \(2^{32}\) bytes of memory, and 64-bit allows direct access of \(2^{64}\) bytes of memory.
Console tab in it and select that tab, which is the left pane by default if you have not opened any script yet. The console runs an R programming language interpreter, which executes the R code for you.Select the Console pane by clicking on its inner area. You should see a cursor blinking after the prompt >.
Type print("Hello World!") and hit enter. You should see the following prompt in the next line [1] "Hello World!". An R base function print() is called with the positional parameter "Hello World". After this line of code is executed, the parameter is printed by the computer to the console.
Even though this line of code is simple, there are actually many things happened behind the scene:
RStudio starts running. RStudio is an integrated development environment (IDE) for R programming language. RStudio is written mainly in Java, C++, and JavaScript. You can look at their source code stats if you are interested by going to the following link, https://github.com/rstudio/rstudio.
Console tab. The R programming language interpreter is basically an infinite loop with memory of executions within the same session:
>.> again and ready for the next code expression. However, if one or more variables are created or updated, the interpreter will remember them.You typed in print("Hello World!") and “enter” in the R interpreter.
When R interpreter sees the “enter”, it checks your code before the “enter”. If you wonder why this is necessary, try type print("Hello World!" and “enter”. Note that the ) is missing at the end. Then, at the next line with the prompt +, type ) and “enter”. The ) in the second line completes the code expression, and the R interpreter starts executing the code expression.
R interpreter found a complete code expression and starts executing the code without the “enter”.
R interpreter starts executing print("Hello World!"), which is a function to print data out. At this point, the prompt [1] "Hello World!" is printed out. Question: why is there [1] before "Hello World"? Hint: try to run print(1:100), where 1:100 creates a vector of integers from 1 to 100.
print("Hello World!") function returns "Hello World!", but there is no variable assigned to that return value. Try to run myVariable <- print("Hello World!") and see what the value is stored in myVariable. You can see it either by typing myVariable and “enter” in the console, or find it in the Environment tab of another pane. Environment is basically a collection of variables.
Sys.sleep(10) which asks the interpreter to idle for 10 seconds. You should see a blinking cursor at the beginning of the line. Try:
print("Hello World!") and “enter”.packagesAn R package is essentially a collection of code for certain purposes, such as performing differential expression analysis (DESeq2), convert R markdown to HTML/PDF/Word (Knitr), and general plotting (ggplot2).
In order to better share the code, an R package usually contains the following things as well:
In this section, we will go through the following:
install.packages() function.biocLite() function.library() function.libraryName::functionName syntax.To install from CRAN, run the following code:
install.packages("ggplot2")
install.packages("dplyr")
If you are calling it for the first time, you might be prompted to select the mirrors, which is the same as downloading R programming language. You can just choose the same one as above http://cran.r-project.org.
install.packages() is the function to install package. The first parameter "ggplot2" is the name of the package to be installed.
To install from CRAN, run the following code:
# source() is a function to run an R program.
# source("https://bioconductor.org/biocLite.R"), runs the script at
# https://bioconductor.org/biocLite.R , which defines the
# function `biocLite()`.
#
# You can use your browser to see the code by opening the link.
source("https://bioconductor.org/biocLite.R")
# Install core packages
biocLite()
# Install a specific package named DESeq2
biocLite("DESeq2")
Note: # starts a comment, # and everything afterwards within the same line are not going to be executed.
For both CRAN and Bioconductor packages, the installing process of a package includes basically two steps:
.libPaths().You might wonder why couldn’t we just have one repository. Just list a few reasons:
CRAN and Bioconductor developer communities are distinct. From my impresion, the CRAN community focus more on general usage such as plotting and data manipulations, whereas the Bioconductor community focus more on analyzing a specific type of biological data.
R packages are peer-reviewed by the community. A general purpose package developer might not be able to review a package specifically designed for biological data.
Publishing rules of Bioconductor are more strict than CRAN, which guarentee certain minimum level of package quality.
The code for using CRAN and Bioconductors is the same. You can either import the whole package to your environment using function library(), or execute certain function within a package using the packageName::functionName syntax.
# Run functions using the :: operator
# NOTE: ggplot2::mpg refers to the dataset `mpg` in the package `ggplot2`
ggplot2::ggplot(ggplot2::mpg, ggplot2::aes(displ, hwy, colour = class)) +
ggplot2::geom_point()
# Import an installed CRAN package
library("ggplot2")
# Import an installed Bioconductor package
# Note that the quotes are optional
library(DESeq2)
# Run imported functions from ggplot2
# You will get an error if you run this without importing ggplot2
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point()
You might wonder why we still need to import a library after installing it. The intuition is an analogy to installing a software on your computer. After you install a software, such as RStudio, you might not want to open it and keep it constantly running at the backgorund after starting your computer, because that will slow down your computer and make your working space crowded. When R interpreter starts running, it only has a minimum set of data and functions defined, which is similar to a just started computer. If you want to use the R interpreter to do something else, you need to get those data and functions into your environment, which is similar to opening an application, such as a web browser, on your computer.
However, this analogy does not explain the packageName::functionName syntax, and we need more details to understand what really happened.
Recall that a package is a collection of code structured into a directory located in the R library on your computer. Following is a screen shot of one R library location:
Each folder is an installed package with the folder name of corresponding package name, and the ggplot2 package is the first one in the list. The content of ggplot2 is structured within the folder. Following is the screenshot of the files within the ggplot2 folder:
All packages are structured in the same way but with different content in these files. The code of a package is stored in the R directory, and the datasets are in the data directory. Other files and directories contain documentations, license, citation, etc.
Therefore, using a package, either through importing or :: operator, is to ask R interpreter to use the stored code or datasets. However, there are important distinctions between these two methods.
Importing a packages loads all code and datasets into your R interpreter, so that you can call the package defined functions and use its datasets directly. However, importing will not execute the loaded code.
On the other hand, packageName::functionName does not load anything into your R interpreter but just execute the function. After the executed function returned, the function will not be loaded.
Because R interpreter starts with only a minimum set of packages imported, if you want to use a function or dataset within a package that has not been imported, you can either import that package and use it directly afterwards, or use it through :: operator.
You can get help from two types of documentations, tutorial (vignette) and reference manual. A tutorial helps a beginner to quickly get started using the package, whereas a reference manual lists all the details about the code and datasets included in the package for a more experienced user to look up.
You can get both tutorial and manual either using RStudio or online.
There are R functions to look up the package documentations in your library (recall the figure of package structure). The function vignette("packageName") looks up for tutorial, and help("functionName") looks up for the manual section of that function. After the function being executed by your R interpreter and the specified documentation was found, the documentation will show up in the help tab in one of your RStudio panes. You will receive an error in the console tab if the specified documentation was not found.
Following is a demo of how to use vignette() and help() functions.
# I want to see the the vignette of DESeq2
# This does not require DESeq2 to be imported
vignette("DESeq2")
# I want to get help on the function DESeq() in DESeq2
# This requires DESeq2 to be imported. If we did not import DESeq2 previously,
# you will receive an error message.
help(DESeq)
# However, if you do not want to import DESeq but just want to see the documentation
# of DESeq(), you can use the following code
help(DESeq, package = "DESeq2")
# If you want to know more about these two functions, check their manuals
help(vignette)
help(help)
Google the package and go to either CRAN or Bioconductor depending on the package source. Find the documentations listed in the webpage. Following are the screenshots of the location of the documentation:
![]()
ggplot2documentations at https://cran.r-project.org/web/packages/ggplot2/index.html
![]()
DESeq2documentations at https://bioconductor.org/packages/release/bioc/html/DESeq2.html
The first documentation of DESeq2 on Bioconductor Analyzing RNA-seq data with DESeq2 is the tutorial.
Documentations are usually very helpful, but also keep in mind that Google is also your best friend when learning programming. There are many helpful Q&As on StackOverflow and Biostars. StackOverflow community is focused on general programming and statistics, whereas Biostars community is focused on bioinformatics.
R Markdown is a markup language for formatting an analysis report written in R, which allows you to embed your R analysis code into a document explaining your analysis.
The R package knitr converts R Markdown files into an easy-to-read report in HTML, PDF, or Word format.
To start using R Markdown in RStudio:
Click menu File -> New File -> R Markdown. Then, you will be prompted by a wndow to fill in some relevant parameters. In the Default Output Format: option, select HTML (important), which is the standard markup language for creating web pages. It worth taking time to read the explanations of each option. After clicking OK, a new R Markdown file will show up in a pane of RStudio.
Click the Knit option at the top of the Markdown file pane. Then, you will be prompted to save the R Markdown file. If you hit Knit for the first time, you will be prompted to install several packages that are required to run the “knitting” process. Install them.
test and save it anywhere you prefer. Then, you will see an R Markdown tab show up in the Console pane, and it starts running. When running, knitr executes all R code and generates an HTML report. After it is done, a new window containing the generated HTML report will show up. The report file will also be stored in the same directory as the R Markdown file.If you want to generate your report in PDF, you will need to install a extra tool on your computer outside of RStudio. This tool, TeX, helps knitr to convert R Markdown into PDF. In order to get this tool, click the tiny triangle at the right side of Knit option and select Knit to PDF, and you will receive an error message in the R Markdown tab. Follow the instructions in the error message carefully to install the TeX distribution.
We will expore the features of R Markdown in the next section.
In this section, we will go through the basics of R programming language by analyzing a simple and fake biological dataset. The dataset is stored in the file cell_exp_proliferation.csv within the lab 1 handout. You can open it with a text editor or Excel.
The dataset is collected in an experiment to study the correlation between the expression level of a gene and the proliferation rate of a cell line. In the hypothesis, this gene is able to decrease the proliferation rate of the cells. This dataset includes the expression levels of the gene in wild type, knock down, and overexpressed cells and their proliferation rates. Each condition has 10 biological replicates. Following is the data contained in cell_exp_proliferation.csv.
| Proliferation.Rate | Expression.Level | Condition |
|---|---|---|
| 10.422538 | 2.912106 | Knock Down |
| 10.624962 | 1.672114 | Knock Down |
| 14.845919 | 2.090984 | Knock Down |
| 20.419040 | 2.594304 | Knock Down |
| 17.045714 | 2.035436 | Knock Down |
| 12.541448 | 2.829719 | Knock Down |
| 11.525131 | 2.547715 | Knock Down |
| 10.567788 | 1.700726 | Knock Down |
| 13.761701 | 1.954836 | Knock Down |
| 14.967113 | 2.113366 | Knock Down |
| 5.569469 | 14.036798 | Wild Type |
| 9.017781 | 8.017283 | Wild Type |
| 10.352420 | 8.161492 | Wild Type |
| 15.831499 | 12.402427 | Wild Type |
| 8.375767 | 11.738877 | Wild Type |
| 10.080793 | 4.734751 | Wild Type |
| 7.790262 | 11.745099 | Wild Type |
| 9.461258 | 15.078069 | Wild Type |
| 13.532137 | 3.907414 | Wild Type |
| 9.121692 | 12.117747 | Wild Type |
| 5.365265 | 24.108622 | Overexpression |
| 4.914866 | 24.470910 | Overexpression |
| 4.368023 | 23.133997 | Overexpression |
| 10.246739 | 26.334096 | Overexpression |
| 3.900205 | 22.750532 | Overexpression |
| 10.285317 | 22.365098 | Overexpression |
| 7.091778 | 20.609853 | Overexpression |
| 7.256476 | 17.658962 | Overexpression |
| 2.886242 | 21.826452 | Overexpression |
| 4.848592 | 27.436294 | Overexpression |
This analysis includes the following steps:
Create an R project.
Create an R Markdown file.
Load the dataaset into R.
Output plots and tables.
Important: The things that you need to do are marked with ACTION.
ACTION: Create a new R project: Click menu File -> New Project -> Existing Directory -> select the directory of lab 1 and 2 handout, which contains cell_exp_proliferation.csv, as the project working directory (very important) -> create.
An R project is a session of RStudio, and the working directory is the directory where R interpreter starts running.
ACTION: After the project is created, type dir() and hit “enter” in your Console tab, you should see "cell_exp_proliferation.csv" included in the output together with some other files. The function dir() lists all files under the working directory of R.
ACTION: Create a new R Markdown file as described above, click menu File -> Save, and save it under lab 1 and 2 handout directory (very important).
The first section is your header, which is as following:
---
title: "Your Title"
author: "Your Name"
date: "Some Date"
output: html_document
---
Then it comes the first R code chunk named setup.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
An R code chunk is in the following format:
```{r Your Code Chunk Name, yourCodeChunkParameters=yourParamterValue}
# Your R code to be evaluated by R interpreter.
print("Hello World!")
```
Only code in the code chunks will be executed. Everything outside code chunks will be formated by knitr as plain text.
ACTION: Delete everything below this chunk, and we are going to create our own analysis R Markdown file.
ACTION: Add a level 1 header below the setup code chunk to notify the start of the data loading section by adding the following text:
# Load dataset into R environment
The # at the beginning indicates that the line is a level one header. When knitr sees the # it will format it as the level 1 header in the output report. Similarly ## indicates a level 2 header, ### indicates a level 3 header, etc. More information about the formatting conventions can be found at http://rmarkdown.rstudio.com/authoring_basics.html.
ACTION: Add the following code chunk into the created R Markdown file to load the cell_exp_proliferation.csv into your environment:
```{r load dataset}
# Load the data of cell_exp_proliferation.csv into the variable exp_prolif_df
# 'df' in the end indicates that the variable is a dataframe
# stringsAsFactors=FALSE indicates that we do not want to convert strings into factors
exp_prolif_df <- read.csv("cell_exp_proliferation.csv", stringsAsFactors=FALSE)
# Print the data stored in variable exp_prolif_df
exp_prolif_df
```
The dataset is stored in a file named cell_exp_proliferation.csv under the same directory as the handout. A comma separated values (CSV) format is commonly used to store a table in plain text. CSV file can be opened by text editor, Excel, and RStudio. The units of proliferation rate and expression level are both arbitrary.
Each row is a biological replicate. The columns are the proliferation rate, expression level, and condition.
In this code chunk, we loaded cell_exp_proliferation.csv as a dataframe into R environment. “R environment” has been mentioned many times in this handout, and you might wonder what it really is.
An R environment is associated with each R interpreter session, and it includes all variables that you can access in that R interpreter session.
A important behavior of knitr is that when knitting an R Markdown file (after clicking Knit), knitr starts a brand new R interpreter session and evaluates all code chunks from beginning to the end. As a result, knitr will not be able to access the variables in the console R interpreter environment.
In the above figure, the R interpreter console session cannot access the variable myVariable in the knitr session, vice versa.
In the code chunk above, after the interpreter executes exp_prolif_df <- read.csv("cell_exp_proliferation.csv", stringsAsFactors=FALSE), its environment will have an additional variable named exp_prolif_df. Then, we can print the values stored in that variable in that interpreter session.
However, you cannot print exp_prolif_df in your console session even after knitting the current document, because knitr starts a new session when knitting an document.
Similarly, if you import a library in an interpreter session, all code and datasets will be imported to its associated R environment, and you can use the imported code and datasets in that interpreter session. However, if you quit the current interpreter session and starts a new one, you will need to import that package again before using its code and datasets. Recall that when an R interpreter session started, only a minimum set of packages are imported into its environment.
In addition to the functions and datasets, an R environment also keeps track of many system parameters such as the library location, and the most important one is the current working directory (CWD). You might wonder why it is very important to create the R project and R Markdown files in the lab handout directory, and the reason is that the locations of the R project and R Markdown file are the CWDs of the Console R interpreter session and knitr interpreter session respectively.
If your CWD does not include cell_exp_proliferation.csv, you will receive an error when running exp_prolif_df <- read.csv("cell_exp_proliferation.csv", stringsAsFactors=FALSE), because R interpreter cannot find the file cell_exp_proliferation.csv.
You can change your CWD by clicking menu Session -> Set Working Directory -> Choose Directory.
The locations of files on your computer are specified by their paths.
A path describes the exact method to find your file. For example, if you want to find a file named myFile.txt under a directory called myDirectory under C disk directory in Windows, you need to open C disk and open myDirectory, and you will see myFile.txt. The path of this file is C:\myDirectory\myFile.txt .
Paths separate different files (folder is a special kind file) that you need to open by / in MacOS/Linux and by \ in Windows.
Following are some example paths on Windows, Mac, and Linux.
Windows path of a file named very_important_file.txt under the C disk directory:
C:\very_important_file.txtWindows path of a file named not_important_file.txt under a directory called outdated_files under C disk directory:
C:\outdated_files\not_important_file.txt Mac path of file named very_important_file.txt under your Documents directory:
/Users/yourUserName/Documents/very_important_file.txtMac path of file named not_important_file.txt under a directory called outdated_files under your Documents directory:
/Users/yourUserName/Documents/outdated_files/not_important_file.txtLinux path of file named very_important_file.txt under your Documents directory:
/home/yourUserName/Documents/very_important_file.txtLinux path of file named not_important_file.txt under a directory called outdated_files under your Documents directory:
/home/yourUserName/Documents/outdated_files/not_important_file.txtThe way how knitr R interpreter finds cell_exp_proliferation.csv is through a path relative to CWD. read.csv("cell_exp_proliferation.csv") asks R interpreter to find cell_exp_proliferation.csv under CWD. You can check your CWD path by typing getwd() in Console tab and hit “enter”. If a path is specified without / on MacOS/Linux or DiskLetter:\ on Windows, it will be prepended with the path of CWD to form an absolute path, and R interpreter will further look up that absolute path. If the file specified by the absolute path does not exist, an error message will be prompted stating that the file is not found.
Following is an example of how R failed to find cell_exp_proliferation.csv:
/home/youUserName/Documents/someOtherDirectory, and this directory does not contain the file cell_exp_proliferation.csv.read.csv("cell_exp_proliferation.csv") on the R session: cell_exp_proliferation.csv.read.csv("cell_exp_proliferation.csv") on the R session: /home/youUserName/Documents/someOtherDirectory/cell_exp_proliferation.csv./home/youUserName/Documents/someOtherDirectory/cell_exp_proliferation.csv does not exist, an error message will be prompted.There are two special cases of relative paths:
. refers to the CWD... refers the directory at the immediate upper level of CWD.For example, when CWD is /home/youUserName/Documents/someOtherDirectory:
. is the same as CWD: /home/youUserName/Documents/someOtherDirectory.. is the immediate upper level directory of CWD: /home/youUserName/Documents../someOtherDirectory is the same as CWD../. is the same as CWD.Important note on specifying a path in R on Windows:
If you are running R interpreter on Windows, and you specified a path using a string, you have to use two backslashes \\ in an R string for one backslash \ in its corresponding path. For example, the absolute path of a file is C:\aFile.txt, and the R string to represent this path is "C:\\aFile.txt". This is because R treats backslash \ as a special character in string, so we have to use \\ to represent the real character \ in R strings. There are many rules listed in the following website, https://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html . You do not have to read this now. Use this website as a reference.
Alternatively, replace \ with / in the R string representing path, i.e. "C:/aFile.txt", will also work. This is one of the ad hoc but convenient features in R programming language. See the following post fore more details, https://stackoverflow.com/a/14335218/4638182 .
We are going to do the following two analyses:
Check linear correlation between expression level and proliferation rate using linear regression.
Test whether the means of expression levels are significantly different between conditions using t-test.
Test whether the means of proliferation rate are significantly different between conditions using t-test.
ACTION: Add a level 1 header to notify the start of data anlysis section:
# Analyze the loaded dataset
ACTION: Add a level 2 header to notify the start of linear correlation analysis:
## Check linear correlation between gene expression level and cell proliferation rate
ACTION: Add the following code to create a scatter plot of expression level versus proliferation rate:
```{r plot linear correlation}
# Import the package ggplot2
library(ggplot2)
# Create a scatter plot of expression level versus proliferation with linear regression line
lin_cor_plot <- ggplot(data = exp_prolif_df,
mapping = aes(x = Proliferation.Rate,
y = Expression.Level)) +
geom_point(aes(color = Condition)) +
geom_smooth(method="lm", se = FALSE) +
labs(x = "Proliferation Rate", y = "Expression Level", color = "Condition")
# Display the plot. The function call print() is optional.
print(lin_cor_plot)
```
ACTION: Knit the document by clicking Knit.
In this code chunk, we imported ggplot2 and created a scatter plot using it. We will talk more about making plots using ggplot2 in later lab sessions. The code is explained line by line as the following:
lin_cor_plot <- ggplot(data = exp_prolif_df, mapping = aes(x = Proliferation.Rate, y = Expression.Level)):
,” to specify more parameters in a function call, 2) change line after an operator such as +, 3) change line after { when defining a function and writing a loop.lin_cor_plot <- ggplot(): The function ggplot() is called to make a new plot and store that plot in variable lin_cor_plot.data = exp_prolif_df, specifies that the dataset we are plotting is variable exp_prolif_df.mapping = aes(x = Proliferation.Rate, y = Expression.Level) specifies that we want to plot the column Proliferation.Rate in x-axis and the column Expression.Level in the y-axis.+ at the end of line, which indicates that the plotting is not ended yet.geom_point(aes(color = Condition)):
geom_point(): make a scatter plot.aes(color = Condition): color the points according to the Condition column.geom_smooth(method="lm", se = FALSE):
geom_smooth: add a regression line to the scatter plot.method = lm: use linear regression. lm refers to the function lm() that is used to fit a linear model.se = FALSE: do not plot the confidence intervals of the linear regression.labs(x = "Proliferation Rate", y = "Expression Level", color = "Condition")
labs(): label the plot.x = "Proliferation Rate": label x-axis as Proliferation Rate.y = "Expression Level": label y-axis as Expression Level.color = "Condition": label the color of points as condition.+ at the end of the line, this line concludes the code expression to create the plot.ACTION: Add the following code chunk to print out the statistics of the linear regression:
```{r print linear correlation stats}
# Print the statistics of linear regression
summary(lm(Expression.Level ~ Proliferation.Rate, data = exp_prolif_df))
```
ACTION: Knit the document by clicking Knit.
Code explanation:
summary(lm(Expression.Level ~ Proliferation.Rate, data = exp_prolif_df)):
summary(): summarize lm(Expression.Level ~ Proliferation.Rate, data = exp_prolif_df)lm(): fit a linear model.Expression.Level ~ Proliferation.Rate: the formula of the linear model, which specifies that we want the linear model to be Expression.Level column versus Proliferation.Rate column.data = exp_prolif_df: the dataset is variable exp_prolif_df.The p-value of the linear regression will be printed in the last line of the linear model stats.
The purpose of this analysis is to validate that the gene is knocked down or overexpressed.
ACTION: Add a level 2 header to notify the start of expression level comprisons:
## Compare gene expression levels between different conditions
ACTION: Add a code chunk to create a boxplot of the expression levels of all conditions:
```{r plot gene exp}
# Create boxplot of gene expression levels
exp_boxplot <- ggplot(data = exp_prolif_df,
aes(x = Condition, y = Expression.Level,
color = Condition)) +
geom_boxplot() +
geom_dotplot(aes(fill = Condition), binaxis='y',
stackdir='center', binwidth = 1, dotsize = 0.75) +
labs(x = 'Condition', y = 'Expression Level')
exp_boxplot
```
ACTION: Knit the document by clicking Knit.
Each box in the plot refers to a condition, and the dots are the measured expression levels stacked into bins.
Code explanation:
ggplot(data = exp_prolif_df, aes(x = Condition, y = Expression.Level, color = Condition)) +: similar as previously mentioned.geom_boxplot() +: make a boxplotgeom_dotplot(aes(fill = Condition), binaxis='y', stackdir='center', binwidth = 1, dotsize = 0.75) +:
geom_dotplot(): make a dotplot on top of the box plot.aes(fill = Condition): fill the dots according to their conditions.labs(x = 'Condition', y = 'Expression Level'): similar as previously mentioned.ACTION: Add a code chunk to test the difference between each condition using t-test.
```{r test exp mean diff, message=FALSE}
library(dplyr)
# Select the expression levels of each condition and store them into individual variables.
wt_exp <- exp_prolif_df[exp_prolif_df$Condition == 'Wild Type', 'Expression.Level']
kd_exp <- subset(exp_prolif_df, Condition == 'Knock Down', select = 'Expression.Level', drop = TRUE)
oe_exp <- filter(exp_prolif_df, Condition == 'Overexpression') %>% pull(Expression.Level)
# knock down versus wild type
t.test(x = kd_exp, y = wt_exp, alternative = 'l')
# wild type versus overexpression
t.test(x = wt_exp, y = oe_exp, alternative = 'l')
# knock down versus overexpression
t.test(x = kd_exp, y = oe_exp, alternative = 'l')
```
ACTION: Knit the document by clicking Knit.
The p-values will be printed out.
Code explanation:
We first selected the expression levels of the three conditions using three different methods:
wt_exp <- exp_prolif_df[exp_prolif_df$Condition == 'Wild Type', 'Expression.Level']: This line of code used boolean and direct indexing.
myDataFrameOrMatrix[rowIndicesToSelect, columnIndicesToSelect] .exp_prolif_df$Condition == 'Wild Type': boolean indexing. We want the rows of exp_prolif_df where the column Condition values are equal to the string "Wild Type". exp_prolif_df$Condition selects the Condition column of exp_prolif_df.'Expression.Level': column name indexing.kd_exp <- subset(exp_prolif_df, Condition == 'Knock Down', select = 'Expression.Level', drop = TRUE): This line of code used the function subset(). Check the documentation of subset to look for more details.
oe_exp <- filter(exp_prolif_df, Condition == 'Overexpression') %>% pull(Expression.Level): This line of code used the package dplyr. Go to http://dplyr.tidyverse.org/index.html for more details. The operator %>%, which is defined in the package dplyr, passes the return value of previous expression into the parameter of the next function call. For example, "Hello World!" %>% print() is equivalent to print("Hello World"). The main purpose of introducing the operator %>% is to make R code easier to read.
Then, we used t-test to test for the differences of expression level between conditions.
t.test(x = wt_exp, y = oe_exp, alternative = 'l')
x = wt_exp, y = oe_exp: compare values in wt_exp and oe_exp.alternative = 'l': alternative hypothesis of this test is that the mean of wt_exp is less than oe_exp.t.test for more details.When performing t-test, we selected the expression levels and stored them into differrent variables. This procedure belongs to “data manipulation”. Followiwng is a brief introduction to data manipulation, and we are going to cover more of it in future lab sessions.
Data manipulation is a very general practice when doing data analysis, in which data in one form are turned into another form in order to complete the following analysis. For example, in this analysis, we have to extract the expression values from exp_prolif_df to complete t-tests.
There are basically 3 ways of data manipulation:
Select or filter: select certain data points of interest from a bigger dataset.
Summarize: calculate summary statistics, such mean and standard deviation, of multiple data points.
Reshape: change the “design” of a table. There are two types of table “design”, “wide” and “long” format. One format can be “reshaped” into another using packages like reshape2 and tidyr.
“Wide” table: this “design” exploits the relationship between your data, where the expression level of same set of genes are measured in three cell lines. With this relationship, you do not need to create three individual tables with only one gene or one cell line in each table. You can think of this design as “relational” format, which is not an exact technical term.
| geneName | expressionInCellLineA | expressionInCellLineB | expressionInCellLineC |
|---|---|---|---|
| GeneA | 1 | 20 | 5 |
| GeneB | 5 | 5 | 20 |
| GeneC | 10 | 15 | 30 |
“Long” table: this “design” intends to make a “dictionary” of the data points (expression levels). If you want to look up for the expression level of GeneA in cell line A, find the rows with geneName column containing value GeneA, and then find the row with cellLine column containing value expressionInCellLineA. Then, you find the value you want. You can think of this design as “key-value store” format, which is also not an exact technical term. The “key” refers to the unique identifier of a data point, and “value” refers to the value of that data point.
| geneName | cellLine | expressionLevel |
|---|---|---|
| GeneA | expressionInCellLineA | 1 |
| GeneB | expressionInCellLineA | 5 |
| GeneC | expressionInCellLineA | 10 |
| GeneA | expressionInCellLineB | 20 |
| GeneB | expressionInCellLineB | 5 |
| GeneC | expressionInCellLineB | 15 |
| GeneA | expressionInCellLineC | 5 |
| GeneB | expressionInCellLineC | 20 |
| GeneC | expressionInCellLineC | 30 |
An analysis might be more easily done if the data is in either wide or long format. For more details, check this introduction to the R package reshape2, http://seananderson.ca/2013/10/19/reshape.html .
NOTE: I strongly encrouage you to write the headers and code chunks of this section yourself before using the code we provide below, because this analysis is the same as the previous section on expression level analysis. Hint: You can make a copy the previous section and change all parameters or variable names related to expression level into proliferation rate.
ACTION: Add a level 2 header to notify the start of proliferation rate comprisons:
## Compare proliferation rates between different conditions
ACTION: Add a code chunk to create a boxplot of the proliferation rates of all conditions:
```{r plot proliferation rate}
# Create boxplot of proliferation rates
prolif_boxplot <- ggplot(data = exp_prolif_df,
aes(x = Condition, y = Proliferation.Rate,
color = Condition)) +
geom_boxplot() +
geom_dotplot(aes(fill = Condition), binaxis='y',
stackdir='center', binwidth = 1, dotsize = 0.75) +
labs(x = 'Condition', y = 'Proliferation Rate')
prolif_boxplot
```
ACTION: Knit the document by clicking Knit.
ACTION: Add a code chunk to test the difference between each condition using t-test.
```{r test proliferation rate diff}
# Select the proliferation rates of each condition and store them into individual variables.
wt_prolif <- exp_prolif_df[exp_prolif_df$Condition == 'Wild Type', 'Proliferation.Rate']
kd_prolif <- subset(exp_prolif_df, Condition == 'Knock Down', select = 'Proliferation.Rate', drop = TRUE)
oe_prolif <- filter(exp_prolif_df, Condition == 'Overexpression') %>% pull(Proliferation.Rate)
# knock down versus wild type
t.test(x = kd_prolif, y = wt_prolif, alternative = 'g')
# wild type versus overexpression
t.test(x = wt_prolif, y = oe_prolif, alternative = 'g')
# knock down versus overexpression
t.test(x = kd_prolif, y = oe_prolif, alternative = 'g')
```
ACTION: Knit the document by clicking Knit.
Finally, we output the plots and tables for the record.
ACTION: Add a level 1 header to indicate the start of output section.
# Output results and plots
ACTION: Add a code chunk to save the plots we made.
```{r output plots, results='hide'}
# Save plot in a PDF file named cell_prolif_exp_plots.pdf
# You can also output your plots **individually** in other formats using other
# functions, such as bmp(), jpeg(), and png(). Check the documentation of the
# png() function for more details.
pdf('cell_prolif_exp_plots.pdf')
# Now, the following plots will be printed into cell_prolif_exp_plots.pdf .
lin_cor_plot
exp_boxplot
prolif_boxplot
# dev.off() tells R that we are done saving plots into cell_prolif_exp_plots.pdf .
dev.off()
```
ACTION: Knit the document by clicking Knit.
You should see a PDF file named cell_prolif_exp_plots.pdf generated in your CWD. You can open it using your file browser.
ACTION: Add a code chunk to calculate the mean and standard deviation of the data and output it into a table.
```{r output stats, results='hide'}
# Calculate summary statistics of all conditions
exp_prolif_stats_df <- exp_prolif_df %>%
group_by(Condition) %>%
summarise(expLevelMean = mean(Expression.Level),
expLevelSd = sd(Expression.Level),
prolifRateMean = mean(Proliferation.Rate),
prolifRateSd = sd(Proliferation.Rate))
# Output the summary statistics table
write.csv(exp_prolif_stats_df, 'exp_prolif_stats.csv',
row.names = FALSE)
```
ACTION: Knit the document by clicking Knit.
You should see a CSV file named exp_prolif_stats.csv generated in your CWD. You can open it using text editor or Excel.
This concludes our data analysis.
There are many reasons why you might want to use R for your data analysis rather than Excel, and I will only compare their reproducibility, efficiency, and feasibility. There are definitely disadvantages of using R as well, such as time consuming to learn, but I will not elaborate on them
R Markdown analysis is much easier to be reproduced, because the same analysis can be done by anyone, including yourself after several months, with the R Markdown file and dataset by clicking Knit, and they are guaranteed to have the same results as yours. However, if someone wants to reproduce your data analysis using Excel, that person needs to click and type through the whole analysis, and errors might be introduced during analysis, which might fail that person to reproduce your results.
Using R Markdown is also more efficient than Excel:
There are numerous ready-to-run R packages for you to use when analyzing your data. However, when using Excel, you might need to type a a lot of complifated formulas.
When using R Markdown, you can write the analysis report during analysis, and everything will be formatted by Knitr. However, when using Excel, you need to write your report in Word or some other tools, which takes some time for formatting your tables and plots.
When using R Markdown, you can try out your random analysis idea fairly quickly.
Using R Markdown, you can validate the correctness of you analysis more easily, because the code are written sequencially. However, when using Excel, the order of your operations cannot be easily inferred from the spreadsheets, and you might have to go through each entry to look for your formula.
When analyzing some datasets, it is simply not feasible to use Excel. When the the size of data you are analyzing is more than 500MB, which is not very uncommon in biological data analysis, it might take several miniutes for simply one operation in Excel, such as open, save, copy, paste, and delete. When using R, the same operations are much faster. There are also performance limits of R, but it is more rarely to reach those limits. When you reach the performence limits of R, it will be almost impossible to do the same analysis in Excel.
The optional topics will only be discussed in the lab session if it ends too early. The topics are selected by the TAs to help you better understand programming and statistics. If you are not interested in any optional topic, you can safely ignore it.
In this section, we will discuss the pros and cons of using Linux operating system. If you are interested, we are going to create a virtual machine on your computer to run Linux. Don’t worry, we are not going to ask you to erase all data on your hard drive.
Linux operating system is developed based on the Unix operating system, so as the Mac OS. Following is the pedigree of Unix based operating system. Image credit: https://en.wikipedia.org/wiki/MacOS:
On the other hand, Microsoft Windows is not based on Unix.
You might wonder what is the difference between Linux, Mac OS, and Windows, because they all look the same. Even when you program using an IDE, such as RStudio, things might look the same to you as well. Thus, why bother switching to Linux.
Firstly, it is not very easy to run some bioinformatics tools on Windows. For example, a popular aligner STAR only provides ready-to-run executable files for Linux and Mac OS. If you would like to run it on Windows, you need to compile the executable from source code, which is sometimes very difficult.
In addition, bash script, https://en.wikipedia.org/wiki/Bash_(Unix_shell), is very popular among bioinformaticians to do some simple data analysis, but it can only be easily run on Unix based systems. You might find answers to your programming questions written in bash script, such as the following one on BioStars https://www.biostars.org/p/193471/. The answers are all bash scripts.
Most importantly but might not be very familiar to you, the high performance computing servers or clusters are mostly running Linux operating system. If you want to use these resources, you need to learn bash script. It is also very necessary to be able to use these resources in some cases. For example, you have 50 human RNA-Seq raw data set from your sequencer, which is about 250GB of data, and you want to align them back to the reference genome. It might take days even weeks to complete this on your own computer. However, if you run them parallelly on a server or cluster using 50 high performance CPUs and a large amount of memory, it can be done within several hours.
Following is a list of a few general advantages and disadvantages of Unix based systems,
We can easily prevent wrecking our real operating system by using a virtual machine. If you mistakenly wrecked the virtual machine, just delete everything in it and create a new one. Therefore, if you have already been using a Unix based operating system, you can still benefit from running another one on virtual machine.
A virtual machine is a guest operating system (OS) running on the host OS through virtualizing the hardware resources. When you start your computer, the host OS runs on the hardware of your computer, such as CPU, memory, and network card. When using virtual machine monitor (VMM), you are able to run a guest OS of any kind on your host OS. The VMM uses software resources provided by host OS to simulate a set of hardware resources for guest OS to run. In the following illustration, real and virtual hardware layers are labeled in blue, and software layers are labeled in yellow.
There are many VMM applications, and the most popular ones are the costly VMware products and the free VirtualBox.
Following are the instructions to use VirtualBox for running Linux as a virtual machine:
Download. Download VirtualBox platform package according to your operating system.Ubuntu 16.04.3 LTS image from https://www.ubuntu.com/download/desktop. Wait for the file ubuntu-16.04.3-desktop-amd64.iso to be downloaded. Important: make sure that the there is amd64 in the file name you are downloading, and it means that the image is a 64-bit version.New in the window. Fill the name with ubuntuVM. Important: make sure that the type is Linux and Version is Ubuntu (64-bit). Click Continue. Troubleshoot: If you cannot find the 64-bit option, try google for 32-bit ubuntu image and downloaded it. Then, in step 16, use the 32-bit image.Click Continue to set the memory size of the virtual machine we are creating as recommended.
Choose Create a virtual hard disk now and click Create. This creates a file on your host OS to store the files in guest OS.
VDI (VirtualBox Disk Image) and click Continue.Dynamically allocated and click Continue. It worth taking time to read through the explanation in the window.Create.Settings.Storage. Click the Empty under Controller: IDE.Choose Virtual Optic Disk File... and choose the Ubuntu image you downloaded in step 3. You should see the information updated similarly as following.Click OK.
Select the virtual machine and click Start.
Click Install Ubuntu.
Select Download updates when installing and continue.
Select Erase disk and install Ubuntu. Don’t worry. This will not erase your real hard drive. This is only going to erase the virtual disk we created.
Click Continue. Set up your time zone. Choose language. You might need to enlarge the window to click Continue.
Set username and password. Important: remember your password. Click Continue. Then, wait for the guest OS to be installed on your virtual machine.
Once the installation is done, click Restart Now. Similarly, this will only restart the virtual machine but not your real machine.
When you see the instruction to remove the installation medium and restart, remove the Ubuntu image from the virtual optic drive similarly as steps 13 - 17. By default, VirtualBox should have removed the image. Restart. Troubleshoot: if your virtual machine stucks at a black scree with a few white dots, click menu Machine -> Reset -> confirm reset. Wait for a while, and hopefully you will see the log in screen. If it does not work, try delete the created virtual machine and redo the guest OS installation steps started from step 5.
Enter the password to log in. After this step, your virtual machine is up and running.
Note: The following steps are optional but recommended, which allows you to copy and paste text between host and guest. This is usually a very useful feature.
Click the menu Device in the window running the virtual machine.
Click Install guest additions CD image.... You should see a disk show up in your virtual machine and click Run. Then type in the password you set up for your virtual machine to authenticate.
Wait for the installation to be done. Following the instruction to hit return to close the window.
Click menu Device -> Shared Clipboard -> Bidirectional. This will work after restarting the virtual machine.
Restart your virtual machine by clicking the upper right corner and select Shutdown. Then choose restart at on the left.
Firefox browser. It should work if everything went well. I have tested it on both Mac OS 10.12.6 (16G29) and Windows 10. Trouble shoot: If this does not work, try restart virtual box.alexdobin. (2014) 2018. STAR: RNA-Seq Aligner. https://github.com/alexdobin/STAR.
“Bash (Unix Shell).” 2017. Wikipedia. https://en.wikipedia.org/w/index.php?title=Bash_(Unix_shell)&oldid=816133313.
Best, Ben. (2014) 2018. Rmarkdown-Example: R Markdown Example Showing Figures & Tables with Captions, Equations, Inline R Values and References with a Zotero Library. https://github.com/bbest/rmarkdown-example.
“Bioconductor.” 2017. Wikipedia. https://en.wikipedia.org/w/index.php?title=Bioconductor&oldid=816349709.
“Bioconductor - Install.” 2018. Accessed January 7. http://bioconductor.org/install/.
“HTML.” 2017. Wikipedia. https://en.wikipedia.org/w/index.php?title=HTML&oldid=815581603.
“Markdown.” 2018. Wikipedia. https://en.wikipedia.org/w/index.php?title=Markdown&oldid=818404099.
“Markdown Basics.” 2018. Accessed January 7. http://rmarkdown.rstudio.com/authoring_basics.html.
Morandat, Floréal, Brandon Hill, Leo Osvald, and Jan Vitek. 2012. “Evaluating the Design of the R Language.” In ECOOP 2012 – Object-Oriented Programming, 104–31. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg. doi:10.1007/978-3-642-31057-7_6.
“Oracle VM VirtualBox.” 2018. Accessed January 7. https://www.virtualbox.org/.
“Path (Computing).” 2017. Wikipedia. https://en.wikipedia.org/w/index.php?title=Path_(computing)&oldid=797542097.
“Spread a Key-Value Pair Across Multiple Columns. — Spread • Tidyr.” 2018. Accessed January 7. http://tidyr.tidyverse.org/reference/spread.html.
“Unix.” 2018. Wikipedia. https://en.wikipedia.org/w/index.php?title=Unix&oldid=818559267.
“Using R Markdown for Class Reports.” 2018. Accessed January 7. http://www.stat.cmu.edu/~cshalizi/rmarkdown/.